Oligo-distance: a Sequence Distance Determined by Word Frequencies

نویسندگان

  • L. C. HSIEH
  • LIAOFU LUO
  • FENGMIN JI
چکیده

Differences in the frequencies of chemical words of a given length in two nucleic sequences are used to define an “oligo-distance” between the sequences. Oligo-distances are much easier and faster to compute than the distances conventionally determined by sequence alignment. A correlation between oligo-distance and alignment-distance is observed. The two kinds of distances are used to construct phylogenetic trees for artificially generated sequences and for a set of thirty-five 16S and 18S rRNA sequences. The gross topologies of the trees given by the two kinds of distances are identical when the sequences are complete but only the oligo-distance is robust against sequence deformations such as rearrangement, truncation and random concatenation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches

In this article, we present a user-friendly web interface for two alignment-free sequence-comparison methods that we recently developed. Most alignment-free methods rely on exact word matches to estimate pairwise similarities or distances between the input sequences. By contrast, our new algorithms are based on inexact word matches. The first of these approaches uses the relative frequencies of...

متن کامل

Alignment-free distance measure based on return time distribution for sequence analysis: applications to clustering, molecular phylogeny and subtyping.

The data deluge in post-genomic era demands development of novel data mining tools. Existing molecular phylogeny analyses (MPAs) developed for individual gene/protein sequences are alignment-based. However, the size of genomic data and uncertainties associated with alignments, necessitate development of alignment-free methods for MPA. Derivation of distances between sequences is an important st...

متن کامل

The Intellectual Structure of Knowledge in the Field of Distance Education Using the Co-Word analyses

Background: Co- word analysis is one of the content analysis methods used in scientometric studies and mapping the scientific structure of various fields. The purpose of the present research is to map the structure of distance education using the co-word analysis. Methods: The research method is content analysis using co- word analysis. The research population are 31607 documents indexed in the...

متن کامل

Statistical measures of DNA sequence dissimilarity under Markov chain models of base composition.

In molecular biology, the issue of quantifying the similarity between two biological sequences is very important. Past research has shown that word-based search tools are computationally efficient and can find some new functional similarities or dissimilarities invisible to other algorithms like FASTA. Recently, under the independent model of base composition, Wu, Burke, and Davison (1997, Biom...

متن کامل

Phylogenetic Analysis of Some Luffa Genotypes Based on the sequence of intergenic region of trnH-psbA

Luffa (Luffa cylindrica) is a plant from the Cucurbitaceae family that grows mostly in tropical and subtropical regions, as well as in most regions of Iran. In this research, the genetic diversity of nine native and non-native genotypes of L. cylindrica was investigated through the evaluation of the chloroplast trnH-psbA intergenic region (IGS). After sampling the young leaves, DNA extraction w...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004